Enriching, Editing, and Representing Interlinear Glossed Text

نویسندگان

  • Fei Xia
  • Michael Wayne Goodman
  • Ryan Georgi
  • Glenn Slayden
  • William D. Lewis
چکیده

The majority of the world’s languages have little to no NLP resources or tools. This is due to a lack of training data (“resources”) over which tools, such as taggers or parsers, can be trained. In recent years, there have been increasing efforts to apply NLP methods to a much broader swathe of the worlds languages. In many cases this involves bootstrapping the learning process with enriched or partially enriched resources. One promising line of research involves the use of Interlinear Glossed Text (IGT), a very common form of annotated data used in the field of linguistics. Although IGT is generally very richly annotated, and can be enriched even further (e.g., through structural projection), much of the content is not easily consumable by machines since it remains “trapped” in linguistic scholarly documents and in human readable form. In this paper, we introduce several tools that make IGT more accessible and consumable by NLP researchers.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enriching Interlinear Text using Automatically Constructed Annotators

In this paper, we will demonstrate a system that shows great promise for creating Part-of-Speech taggers for languages with little to no curated resources available, and which needs no expert involvement. Interlinear Glossed Text (IGT) is a resource which is available for over 1,000 languages as part of the Online Database of INterlinear text (ODIN) (Lewis and Xia, 2010). Using nothing more tha...

متن کامل

Xigt: extensible interlinear glossed text for natural language processing

This paper presents Xigt, an extensible storage format for interlinear glossed text (IGT). We review design desiderata for such a format based on our own use cases as well as general best practices, and then explore existing representations of IGT through the lens of those desiderata. We give an overview of the data model and XML serialization of Xigt, and then describe its application to the u...

متن کامل

Extracting Interlinear Glossed Text from LaTeX Documents

We present texigt, a command-line tool for the extraction of structured linguistic data from LTEX source documents, and a language resource that has been generated using this tool: a corpus of interlinear glossed text (IGT) extracted from open access books published by Language Science Press. Extracted examples are represented in a simple XML format that is easy to process and can be used to va...

متن کامل

Towards Creating Precision Grammars from Interlinear Glossed Text: Inferring Large-Scale Typological Properties

We propose to bring together two kinds of linguistic resources—interlinear glossed text (IGT) and a language-independent precision grammar resource—to automatically create precision grammars in the context of language documentation. This paper takes the first steps in that direction by extracting major-constituent word order and case system properties from IGT for a diverse sample of languages.

متن کامل

Mining and Migrating Interlinear Glossed Text

1.0 Introduction The World Wide Web is rapidly becoming the primary source for disseminating data on the world’s languages. Language researchers, linguists and language communities are regularly posting a variety of language data to the Web, including lexicons, teaching materials, language recordings and transcriptions, language descriptions, and grammars. Also posted in large quantities are sc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015